This document is an exploration of the Prosper Loan Data, which contains 113,937 observations across 81 variables. Not all variables will be explored.
str(loan_data)
## 'data.frame': 113937 obs. of 82 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$100,000+",..: 4 5 7 4 2 2 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ region : chr "colorado" "colorado" "georgia" "georgia" ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2005 2008 2012 2011 2013 2014
With this histogram, I’m trying to get a sense of the origination volume by year. I was curious when the company began so that I could know if this data set covers the life of the business. I discovered this wiki which indicated that the company was open to the public on February 5th, 2006. It’s slightly odd that there are 16 loans with LoanOriginationDate values in 2005.
Let’s take a finer grain look at the originations by year and quarter.
We need to adjust the factor levels for LoanOriginationQuarter since they are not in chronological order.
There are two interesting observations about this histogram. The first is the complete drop off in originations in Q2 2009 (there isn’t any data for Q1 2009). My initial thought was that this reflected the impact of the financial crisis of 2007-2008, but I would not expect originations to drop to zero.
In reading the wikipedia article mentioned previously, it seems that this gap in originations was due to a lawsuit. It still may be the case that the financial crisis contributed to the drop in volume. If we had time-series data we could look for deteriorating performance (i.e. increasing delinquencies) during the financial crisis.
In December 2010, Prosper made a change to its business model. Prior to that date interest rates were set by investors through a Dutch style auction. The current business model has fixed rate loans with the rate set by an internal loan pricing and scoring algorithm. Any analysis should probably treat these as two separate data sets, but first let’s see if we can find the actual point that Prosper scores came into use.
The field description file says the ProsperScore is applicable to loans originated after July 2009. The graph above confirms that information.
One potential avenue of investigation would be to look at the characteristics of the borrowers from these two subsets of the data. Did the introduction of the ProsperScore coincide with tighter lending criteria (higher credit scores, lower debt to income)?
Let’s split the data into loans originated prior to the introduction of Prosper scores and those originated after.
I’m very surprised the Term of the loans is so concentrated at 36 months. I was expecting that the Terms would be on the shorter side (<= 60 months), but with a much more diverse. Let’s see if the BorrowerRate is similarly concentrated.
The distribution of interest rates is slightly skewed to the right and also has some large spikes in the upper tail of the distribution. These spikes may be some sort of maximum allowable rate, that is set either internally or is prescribed by regulations. There are also a a few outliers with rates above 40%, but these are all prior to the introduction of the Prosper Score.
The maximum borrower rate declined nearly 5% from 2011 to 2014. This could reflect overall market conditions and/or reflect higher credit standards for borrowers following the introduction of the Prosper Score.
Interesting that there are loans with a zero BorrowerRate. Let’s look at a summary of those loans.
##
## A AA B C D E HR NC
## 0 1 1 0 2 1 0 3 0
The CreditGrade is not what one would expect for 0% interest rate loans. Since there are only 8 loans it isn’t going to impact the analysis.
## Warning: Removed 131 rows containing non-finite values (stat_count).
It looks like the loans are distributed across the rating categories. When we move on to the multivariate analysis we’ll want to see how these rating categories correlate to other attributes.
The bulk of the distribution lies between 0 and 20% estimated loss, though there are a few outliers. The estimated loss of the most extreme outliers is 36.6%, which is higher than the Borrower Rate on the loans. The estimated return on these loans is negative.
The majority of loans fall into the Debt Consolidation category. It is surprising how many are labelled as Not Available. Given the plethora of categories available I would be a wary of a borrower who lists under this category. We’ll see if Prosper investors lend at higher rates to these borrowers.
## # A tibble: 52 x 2
## BorrowerState loans
## <fctr> <int>
## 1 CA 14717
## 2 TX 6842
## 3 NY 6729
## 4 FL 6720
## 5 IL 5921
## 6 5515
## 7 GA 5008
## 8 OH 4197
## 9 MI 3593
## 10 VA 3278
## # ... with 42 more rows
I guess it’s not too surprising that a California based company would do most of its business in California. The 5,515 records with missing BorrowerState values is troubling. The borrower’s state should be required information.
These are strange categories in that they don’t seem mutually exclusive (e.g. a borrower can be both Employed and Full-time). Does Not employed mean unemployed? It’s also unclear what Other could mean in this context. Not available is also peculiar - not sure I would lend to someone who says there Employment Status isn’t available.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
The EmploymentStatusDuration distribution is skewed to the right as shown in the graph as well as the value of the mean being significantly larger than the median. The summary statistics suggest that the typical borrower is in their 20s-30s. Looking at the length of the credit history may add evidence for that claim.
The portfolio is roughly split between homeowners and non-homeowners. It will be interesting to see if this affects the borrower rate when we look at the bivariate analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.4 720.0 880.0
The CreditScoreRangeLower distribution is relatively normal with the interquartile range squarely in the Fair and Good credit categories. It is interesting to note that after the introduction of the Prosper Score there are no longer any loans with Credit Scores below 600. This is consistent with our previous idea that credit standards were tightened when lending resumed in Q3 2009.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1947 1990 1995 1994 2000 2012 697
You must be 18 years old to establish credit. The median of the FirstRecordedCreditLine is 1995, so assuming most people choose to take credit soon after turning 18, and given that our loan data is from the period 2006 through 2014, our borrowers are roughly 18 years old + (2006-1995) = 29 to 18 + (2014-1995) = 37. This is a little higher than I would have thought based solely on the Employment Duration, but still consistent with the idea that the core customer of Prosper tends to be younger.
The distribution of DebtToIncome is fairly normal when looking at the region between 0 and 1, which is where the vast majority of the observations lie. There is still a number of outliers. In the bivariate analysis I would expect these higher DTI loans to have higher Borrower Rates.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1200 0.2000 0.3242 0.3000 10.0100 1247
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.150 0.220 0.259 0.320 10.010 7307
The majority of borrowers can verify their income. I would expect this feature to be associated with the Borrower Rate. On average we should expect that borrowers who can not verify their income would get higher interest rates, all other things being equal.
As expected, all of the Inactive loans are either Completed, Cancelled, Charged-off or Defaulted. At the time this data set was pulled there is very little delinquency.
Let’s dig a little deeper into the Chargeoff and Defaulted categories.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 10.00 16.00 18.03 24.00 51.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 13.00 14.71 19.00 41.00
Interestingly, loans default sooner on average after the introduction of the Prosper Score, though the distribution is narrower. Of those borrowers that default, 50% will defaulted in a small 9 month window between 9 and 19 months for loans originated with a Prosper Score.
This feature represents the number of credit accounts opened within the last 6 months. This distribution is highly skewed, with the mode of the distribution being 0. Other things being equal, I would expect those borrowers who have non-zero values for this feature to have higher interest rates.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.802 1.000 20.000 7544
The dataset consists of 113,937 loans with 81 variables.
When looking at financial products such as loans the most important features are likely to be related to the credit worthiness of the borrower.
For example, Prosper Score, Credit Score Range Lower or Upper, Income Range, Employment Status, Debt to Income Ratio, and BorrowerState are variables of interest that may help understand the ProsperScore and the likelihood of repayment.
Variables such as Listing Category, Current Delinquencies, and Trades Opened Last 6 Months may also provide information about the likelihood of repayment.
I did create a few new variables during the univariate exploration. The first two, prosper_score and active are boolean flags. The field prosper_score indicates if the loan was originated prior to the introduction of the score or after (TRUE = after). Similarly, active is a flag that tags loans according to their LoanStatus (inactive loans are those with LoanStatus in ‘Completed’, ‘Cancelled’, ‘Chargedoff’, ‘Defaulted’). I also created a months_to_default variable to measure the number of months from origination to default for loans in either the defaulted or charged-off Loan Status.
Yes. I re-ordered the levels on the variables ProsperRating..Alpha., LoanStatus, and LoanOriginationQuarter so that they would reflect an order that made sense when visualized. For example, the ProsperRating..Alpha. variable was re-ordered to go from the best rating to the worst.
Clearly the strongest correlations are between Prosper Score and Estimated Loss and Borrower Rate and Estimated Loss. This makes sense because the Prosper Score is a measure of the riskiness of the borrower and the Borrower Rate should be a reflection of that riskiness. The more likely the borrower is to default on the loan the more Prosper must charge to compensate for that risk.
Let’s get a look at the Estimated Loss and Borrower Rate.
The relationship between Estimated Loss and Borrower Rate is linear until the estimated losses reach approximately 17%. Beyond that point the Borrower Rate is already above 30%, which means we are near the maximum rates seen earlier. This upper bound on the interest rate will clearly distort the relationship as we can see above.
I’m curious how closely the Estimated Loss correlates to the Prosper Rating categories.
Clearly, the Prosper scoring algorithm is correlated to the Estimated Loss. It appears that there isn’t any overlap in the rating categories with regard to Expected Loss.
## AA A B C D E HR
## 0.0049 0.0200 0.0410 0.0610 0.0925 0.1225 0.1525
## AA A B C D E HR
## 0.0199 0.0399 0.0599 0.0899 0.1190 0.1490 0.3660
The two summaries above confirm what the boxplot shows, which is that there is no overlap between the Estimated Loss values across the rating categories.
Unlike the boxplot of Prosper Rating and Estimated Loss, the number of outliers in the boxplot above and the overlap across rating categories is curious. Why would someone in the highest rating category be paying in excess of 15% interest? This suggests that there are other factors driving the final Borrower Rate, which may be reflected in the Prosper Score.
The boxplot above gives a pretty clear indication that the credit profile of the borrowers shifted with the introduction of the prosper score in mid-2009. The pre-prosper score loans have significantly lower credit scores, with the median being below the median value for all of the Prosper Rating groups (even the HR - High Risk category). It also shows the relationship between the rating categories and the credit score, which is as we would expect. We can also see that category C is the most dense rating category, which is consistent with what we saw in the univariate analysis.
As expected, there is a negative relationship between Borrower Rate and the Credit Score of the borrower. As the credit score of the borrower increases the interest rate decreases.
The relationship between Prosper Score and Prosper Rating is not as strong as I expected. Though the general direction of the relationship is what one would expect, there is quite a lot of variability. Let’s see if there are other features of the data that may help explain the Prosper Score better.
The CreditScoreRangeLower feature would be high on the list of explanatory variables to include in any model of Prosper Score. There is a clear positive correlation between the Credit Score and the Prosper Score. It’s quite possible that the credit score is an input into the Prosper Score algorithm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1269 0.1765 0.1871 0.2489 0.3500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1474 0.2015 0.2059 0.2639 0.3600
As expected, borrowers who are homeowners command lower interest rates on average than non-homeowners. It is likely that home ownership is correlated with other features like employment and credit history.
Though not particularly strong, there is a relationship between the number of credit inquiries and the Prosper Score. This makes sense as it may be an indication of the need for credit.
The relationship between Debt to Income Ratio and Prosper Score has many outliers, which I have cut off using coord_cartesian, but it does show a negative correlation and is likely a component of the Prosper Scoring algorithm. It makes sense that a borrower with a high DTI will have less capacity to handle any financial shocks and may therfore be at higher risk of default.
Similar to DTI and InquiriesLast6Months, the strength of the relationship between TradesOpenedLast6Months and Prosper Score is rather weak. It still may be a component of the Prosper Scoring alorithm. Opening a lot of credit accounts could be an indication of impending financial stress.
The strongest relationship is that between the Credit Score and the Prosper Score. These are important pieces of information to understanding the riskiness of a loan. Other features such as the debt to income ratio and recent credit inquiries have negative relationships to the Prosper Score.
I was a little surprised that there wasn’t a clearer relationship between the listing category and the borrower rate. For example, the ‘Not Available’ category has one of the lowest average interest rates whereas loans for the Auto category have a higher average rate. My hypothesis for this is that Prosper investors do not have traditional remedies for default, such as repossession of collateral so there is no longer a distinction between collateralized and uncollateralized lending.
The strongest relationship in the features I investigated was that between the Prosper Score and the CreditScoreRangeLower. This suggests that either Credit Score is used in the formulation of the Prosper Score or that the Prosper Score algorithm shares some similarities to the algorithm used for generate Credit Scores.
This visualization shows that ProsperScore and Credit Score together do an OK job at capturing the riskiness of loans. At each Credit Score level the defaulted loans group tends to have a lower average ProsperScore
The boxplots above shows loans that default carry higher interest rates on average, though it is interesting that there are a wide range of outliers at each Prosper Score. This is a clear indication that factors other than Prosper Score determine the Borrower Rate. The same higher interest rates on defaulted loans can be seen across the listing categories as well.
The visualization above shows the relationship we would expect based on the statements from the Prosper website. The combination of Prosper Score and FICO does correlate to the Expected Loss.
This scatterplot supports the idea that defaults occur more frequently for lower credit quality borrowers (lower credit score). It also gives some evidence that higher debt to income ratios play a role in defaults.
From the choropleth above we can see a pretty clear pattern of higher default rates in the Southeast portion of the country. The only exception to this is South Dakota, which has the highest default rate in the country.
## # A tibble: 6 x 2
## region default_rate
## <chr> <dbl>
## 1 south dakota 0.11111111
## 2 rhode island 0.10243902
## 3 tennessee 0.10194805
## 4 kentucky 0.09898763
## 5 louisiana 0.09857482
## 6 mississippi 0.09777778
## # A tibble: 6 x 2
## region default_rate
## <chr> <dbl>
## 1 hawaii 0.05539359
## 2 colorado 0.05478662
## 3 <NA> 0.05454545
## 4 nebraska 0.05035971
## 5 wyoming 0.04878049
## 6 montana 0.03619910
Prosper Score and CreditScoreRangeLower together show a fairly strong relationship to the EstimatedLoss. This is consistent with information provided on the Prosper website that describes the base estimated loss rate.
I was surpised by the relatively weak relationships in the features I looked at such as the number of recent credit inquiries and the debt to income ratio. It suggests that perhaps the Prosper Scoring algorithm utilizes many features that each contribute. It could also be the case that the Prosper Scoring algorithm relies heaviliy on the Credit Score of the borrower since those scores already take into account a range of borrower features.
I did tinker with implementing a logit model to predict default, but am not including it here.
This boxplot reflects the 11 Prosper Rating categories grouped into buckets. Each Prosper Rating bucket is broken down into Estimated Loss range buckets. On the y-axis we show the Credit Score. What this boxplot communicates is the positive relationship between the internal Prosper Score and the external Credit Score for the borrowers. The higher Prosper Score buckets contain larger amounts of loans with lower estimated losses, while the lower Prosper Score buckets contain more loans from the higher estimated loss buckets.
This boxplot shows the relationship of Borrower Rate to Prosper Score and further breaks down the relationship by whether the loan defaulted or not. On the y-axis we show the interest rate of the loan. There is a clear pattern of higher interest rates for lower Proser Scores. This is as we would expect because the Prosper Score is an indication of the riskiness of the loan and the interest rate is the main mechanism to compensate investors for their level of risk. What is particularly interest, and an avenue for further investigation, is the number of outliers. There are loans with Prosper Scores of 1 (the worst) with interest rates near those of the highest Prosper Score borrowers. Similarly, there are loans with the highest Prosper Score with interest rates near 20%. Understanding these outliers would probably answer a lot of questions about the Prosper Scoring algorithm. Nonetheless, the expected negative correlation is observed in the data.
While only a basic choropleth, this visualization shows a cluster of higher default levels in the southeastern portion of the U.S.. The default rate is measured as the number of loans with Loan Statuses of either Defaulted or Chargedoff in each state as a percentage of all loans originated in that state.
This exploratory data analysis was challenging. The large number of features available in the dataset made it difficult to choose where to start. It was also challenging because there are many ways to do things in R. These are both good things and I feel that I have come through the other side having gained a lot of valuable practice with R.
My main interest in this dataset is with predicting default so I would try to dig deeper into the various features that may relate to that outcome. Since the dependent variable is binary I would think that a logit model could be a decent starting point for any time of modeling. It would also be interesting to acquire time-series data related to these loans which would allow the next step of not only predicting default, but when loans will default.